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This  chapter  studies  the  problem  of  time-series  classification  and  presents 
an  overview  of  recent  developments  in  the  area  of  feature  extraction  and 
information  fusion.  In  particular,  a  recently  proposed  feature  extraction 
algorithm,  namely  symbolic  dynamic  filtering  (SDF),  is  reviewed.  The 
SDF  algorithm  generates  low-dimensional  feature  vectors  using  proba¬ 
bilistic  finite  state  automata  that  are  well-suited  for  discriminative  tasks. 
The  chapter  also  presents  the  recent  developments  in  the  area  of  sparse- 
representation-based  algorithms  for  multimodal  classification.  This  in¬ 
cludes  the  joint  sparse  representation  that  enforces  collaboration  across 
all  the  modalities  as  well  as  the  tree-structured  sparsity  that  provides  a 
flexible  framework  for  fusion  of  modalities  at  multiple  granularities.  Fur¬ 
thermore,  unsupervised  and  supervised  dictionary  learning  algorithms 
are  reviewed.  The  performance  of  the  algorithms  are  evaluated  on  a  set 
of  field  data  that  consist  of  passive  infrared  and  seismic  sensors. 


1.  Introduction 

Unattended  ground  sensor  (UGS)  systems  have  been  extensively  used  to 
monitor  human  activities  for  border  security  and  target  classification.  Typ¬ 
ical  sensors  used  for  this  purpose  are  Seismic  and  Passive  Infrared  (PIR) 
sensors  which  are  commonly  used  for  target  detection.  Nevertheless  dis¬ 
crimination  of  different  types  of  targets  from  footstep  signals  is  still  a  chal¬ 
lenging  problem  due  to  environmental  noise  sources  and  locality  of  the 
sensors.1  This  study  deals  with  the  problem  of  target  classification,  and 
more  generally  time-series  classification,  in  two  main  directions,  feature 
extraction  and  information  fusion. 
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Several  feature  extraction  methods  have  been  proposed  to  generate  dis¬ 
criminative  patterns  from  time-series  data.  This  includes  kurtosis  algo¬ 
rithm,2  Fourier  and  Wavelet  analysis. 3~5  Recently,  a  symbolic  dynamic 
filtering  (SDF)  algorithm  has  been  proposed  for  feature  extraction  from 
time-series  data  and  has  shown  promising  results  in  several  applications  in¬ 
cluding  robot  motion  classification  and  target  classification. 6-8  In  the  SDF 
algorithm,  the  time  series  data  are  first  converted  into  symbol  sequences, 
and  then  probabilistic  finite-state  automata  (PFSA)  are  constructed  from 
these  symbol  sequences  to  compress  the  pertinent  information  into  low¬ 
dimensional  statistical  patterns.9  The  advantage  of  the  SDF  algorithm  is 
that  it  captures  the  local  information  of  the  signal  and  it  is  capable  of  mit¬ 
igating  noise.  In  this  chapter,  the  SDF  algorithm  is  briefly  reviewed  and  is 
subsequently  used  for  feature  extraction  from  PIR  and  Seismic  sensors. 

This  chapter  also  studies  information  fusion  algorithms  which  is  then 
utilized  to  integrate  the  information  of  the  PIR  and  Seismic  modalities. 
As  it  has  been  widely  studied,  information  fusion  often  results  in  better 
situation  awareness  and  decision  making.10,11  For  this  purpose,  several 
sparsity  models  are  discussed  which  provide  a  framework  for  feature-level 
fusion.  Feature  level  fusion12  is  relatively  less-studied  topic  compared  to 
the  decision  level  fusion,13,14  mainly  due  to  the  difficulty  in  fusing  hetero¬ 
geneous  feature  vectors.  However,  as  it  will  be  shown,  structured  sparsity 
priors  can  be  used  to  overcome  this  difficulty  and  result  in  state-of-the  art 
performance.  In  particular,  the  joint  sparse  representation15  is  presented 
which  enforces  collaboration  across  all  the  modalities.  The  tree-structured 
sparse  representation16,1'  is  also  presented  that  allows  fusion  of  different 
modalities  at  multiple  granularities.  Moreover,  unsupervised  and  super¬ 
vised  dictionary  learning  algorithms  are  discussed  to  train  more  compact 
dictionaries  which  are  optimized  for  reconstructive  or  discriminative  tasks, 
respectively. 18, 19 

The  rest  of  this  chapter  is  organized  as  follows.  Section  2  briefly  de¬ 
scribes  the  SDF-based  feature  extraction.  Section  3  succinctly  discusses 
the  sparsity-based  models  for  single-modal  and  multi-modal  classification 
and  Section  4  presents  the  dictionary  learning  algorithms.  Section  5  pro¬ 
vides  a  comparitive  study  on  the  performance  of  the  discussed  algorithm 
for  the  application  of  target  classification,  which  is  followed  by  conclusion 
and  recommendations  for  future  research  in  Section  6. 
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2.  Symbolic  dynamic  filtering  for  feature  extraction  from 

time-series  data 

An  important  step  for  time-series  classification  is  feature  extraction  from 
sensor  signals.  This  step  can  be  performed  using  different  signal  processing 
tools  including  principal  component  analysis,20  Cepstrum,  wavelet  anal¬ 
ysis,22  and  SDF.6,23  It  has  recently  been  shown  that  SDF  can  result  in 
improved  classification  performance  by  compressing  the  information  within 
a  time-series  window  into  a  low-dimensional  feature  space  while  preserv¬ 
ing  the  discriminative  pattern  of  the  data  in  several  applications  including 
target  detection  &  classification9,24  and  prediction  of  lean  blowout  phe¬ 
nomena  in  confined  combustion.25  The  detailed  information  and  different 
versions  of  SDF  can  be  found  in  earlier  publications,23,26  and  this  section 
briefly  reviews  a  version  of  this  algorithm  which  is  used  later  for  feature 
extraction. 

The  algorithm  consists  of  a  few  steps.  The  signal  space,  which  is  ap¬ 
proximated  by  the  training  samples,  is  first  partitioned  into  a  finite  number 
of  cells  that  are  labeled  as  symbols.  Then,  PFSA  are  formed  to  represent 
different  combinations  of  blocks  of  symbols  on  the  symbol  sequence.  Fi¬ 
nally,  for  a  given  time-series  data,  the  state  transition  probability  matrix 
is  generated  and  the  SDF  feature  is  extracted.  The  pertinent  steps  of  this 
procedure  are  described  below. 

Let  {ui,...,M]v}  be  the  set  of  N  training  time-series  where  Uj  £ 
KlxM,j  =  l,...,iV,  consists  of  M  consecutive  data  points.  Let  E  be  a 
set  of  finitely  many  symbols,  also  known  as  alphabet,  and  its  cardinality 
denoted  as  |E|.  In  the  partitioning  step,  the  ensemble  of  training  samples 
is  divided  into  |E|  mutually  exclusive  and  exhaustive  cells.  Each  disjoint 
region  forms  a  cell  in  the  partitioning  and  is  labeled  with  a  symbol  from 
the  alphabet  E.  Consequently,  each  sample  of  a  time-series  is  located  in  a 
particular  cell  and  is  coded  with  the  corresponding  symbol,  which  results  in 
a  string  of  symbols  representing  the  (finite-length)  time-series.  There  are  at 
least  two  ways  for  performing  the  partitioning  task:  the  maximum  entropy 
partitioning  and  uniform  partitioning.26  The  maximum  entropy  partition¬ 
ing  algorithm  maximizes  the  entropy  of  the  generated  symbols  and  results 
in  (approximately)  equal  number  of  data  points  in  each  region.  Therefore, 
the  information-rich  cells  of  a  data  set  are  partitioned  finer  and  those  with 
sparse  information  are  partitioned  coarser.  On  the  other  hand,  the  uniform 
partitioning  method  results  in  equal-sized  cells.  Maximum  entropy  parti¬ 
tioning  is  adopted  here.  The  choice  of  alphabet  size  |E|  largely  depends  on 
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the  specific  data  set  and  can  be  set  using  cross-validation  algorithms.  A 
smaller  alphabet  size  results  in  a  more  compressed  feature  vector  which  is 
more  robust  to  the  time-series  noise  at  the  cost  of  more  information  loss. 

In  the  next  step,  a  PFSA  is  used  to  capture  the  information  of  a  given 
time-series.  The  PFSA  states  represent  different  combinations  of  blocks  of 
symbols  on  the  symbol  sequence  where  the  transition  between  a  state  to 
another  state  is  governed  by  a  transition  probability  matrix.  Therefore,  the 
“states”  denote  all  possible  symbol  blocks  within  a  window  of  certain  length. 
For  both  algorithmic  simplicity  and  computational  efficiency,  the  D-Markov 
machine  structure6  has  been  adopted  for  construction  of  PFSA.  It  is  noted 
that  D-Markov  machines  form  a  proper  subclass  of  hidden  Markov  models 
(HMM)  and  have  been  experimentally  validated  for  applications  in  various 
fields  of  research  (e.g.,  anomaly  detection  and  robot  motion  classification8). 
A  D-Markov  chain  is  modeled  as  a  statistically  locally  stationary  stochastic 
process  S  =  ■  •  •  s_iSo  •  •  •  si  •  •  • ,  where  the  probability  of  occurrence  of  a  new 
symbol  depends  only  on  the  last  D  symbols,  i.e. , 

P[sn  |  *  *  *  Sn—D  '  '  *  Sn—  l]  —  P\Sn  \  S n—D  '  '  *  Sn—  l]- 

Words  of  length  D  on  a  symbol  string  are  treated  as  the  states  of  the 
.D-Markov  machine  before  any  state-merging  is  executed.  The  set  of  all 
possible  states  is  denoted  as  Q  =  {?i,  92,  ■  •  • ,  9|q|}  and  |Q|  is  the  number  of 
(finitely  many)  states  and  \Q\  <  |S|D.  Here,  a  D-Markov  Machine  with  the 
symbol  block  length  of  each  state  D  =  1  is  used,  i.e.,  |Q|  =  |E|.  In  this  case, 
the  number  of  states  are  equal  to  the  number  of  symbols,  i.e.,  |Q|  =  |E|, 
where  the  set  of  all  possible  states  is  denoted  as  Q  =  {qi,  q2, . . . ,  9|q|}  and 
|Q|  is  the  number  of  (finitely  many)  states.  The  transition  probabilities  are 
defined  as: 

P(qk\qi)  =  ^ -  w  qkiqi&Q  (1) 

where  S  ( qi.qu )  is  the  total  count  of  events  when  q^  occurs  adjacent  to  qi 
in  the  direction  of  motion.  Consequently,  the  state  transition  probability 
matrix  of  the  PFSA  is  given  as 

~P{q  1  Ui)  p(v\q\  I  m) 

n=  ;  •  (2) 

p(v  1  hci)  ■■■Pi.viQi  Uiqi). 

By  appropriate  choice  of  partitioning,  the  stochastic  matrix  n  is  irreducible 
and  the  Markov  chain  is  ergodic,  i.e.,  the  probability  of  every  state  being 
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reachable  from  any  other  state  within  finitely  many  transitions  must  be 
strictly  positive  under  statistically  stationary  conditions.27 

For  a  given  time-series  window,  the  SDF  features  is  then  constructed  as 
the  left  eigenvector  p  corresponding  to  the  unity  eigenvalue  of  the  stochastic 
matrix  II.  It  should  be  noted  that  II  is  guaranteed  to  have  unique  unity 
eigenvalue.  The  extracted  feature  vector  is  indeed  the  stationary  state 
probability  vector. 

3.  Sparse  representation  for  classification  (SRC) 

The  SRC  algorithm  has  been  recently  introduced28  and  has  shown  promis¬ 
ing  results  in  several  classification  applications  such  as  robust  face  recogni¬ 
tion,28  visual  tracking,29  and  transient  acoustic  signal  classification.30  Here 
the  SRC  algorithm  is  reviewed.  Consider  a  C-class  classification  problem 
with  N  training  samples  from  C  different  classes.  Let  Nc,  c  £  {1, . . . ,  C }  , 
be  the  number  of  training  samples  for  the  cth  class  and  n  be  the  di¬ 
mension  of  the  feature  vector.  Here  n  is  equal  to  the  the  SDF  alpha¬ 
bet  size  |E|.  Also  let  xcj  £  R"  denotes  the  jth  training  sample  of  the 
cth  class  where  j  £  {1,...,1VC}.  In  the  SRC  algorithm,  a  dictionary 
X  =  [XiX2  ■  ■  ■  Xc]  €  R"xAr  is  constructed  by  stacking  the  training  sam¬ 
ples  where  sub-dictionary  Xc  =  [xc^\,  £cCj2,  •  ■  • ,  xc  ,j\rc]  £  RnxiVc  consists  of 
the  training  samples  for  the  cth  class,  and  N  =  1  Nc  is  the  total  num¬ 

ber  of  train  samples.  Given  a  test  sample  p  £  R”,  it  is  classified  based 
on  the  minimum  reconstruction  error  of  it  using  the  different  classes.  The 
underlying  assumption  of  SRC  is  that  a  test  sample  from  the  cth  class  lies 
(approximately)  within  the  subspace  formed  by  the  training  samples  of  the 
cth  class  and  can  be  represented  using  a  linear  combination  of  a  few  train¬ 
ing  samples  in  Xc.  In  other  words,  the  test  sample  p  from  the  cth  class  can 
be  represented  as 

p  =  Xa  +  e,  (3) 

where  a  is  the  coefficient  vector  whose  entries  have  value  0’s  ex¬ 
cept  for  some  of  the  entries  associated  with  the  cth  class,  i.e.  a  = 
[0T, . . . ,  0T,  ctj,  0T, . . . ,  0T]  ,  and  e  is  a  small  error/noise  term  due  to 
the  imperfectness  of  the  test  and  training  samples.  For  this  reason  the  al¬ 
gorithm  seek  to  obtain  the  sparse  coefficient  vector  a  through  the  following 
i\  optimization  problem: 

a*  =  argrnin  \\p  —  Xol\\i2  +A||a||^, 


(4) 
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with  £i-norm  defined  as  HaH^  =  and  A  is  a  regularization  pa¬ 

rameter.  The  solution  of  the  above  optimization  problem  is  sparse  pro¬ 
vided  that  the  regularization  parameter  A  is  chosen  sufficiently  large.  Af¬ 
ter  the  solution  of  the  optimization  problem  is  found,  the  test  sample  p 
is  classified  by  comparing  the  reconstruction  errors  of  different  classes. 
For  this  purpose,  let  Sc(a)  £  be  a  vector  whose  only  non-zero  ele¬ 
ments  are  those  entries  in  a  that  are  associated  with  class  c  in  AC,  i.e. 
Sc (a)  =  [01 , . . . ,  0T, , 01 , ,01]T .  The  label  of  the  test  data  is  then 
predicted  using 


c*  =  argmin  ||p  —  XSc(a*)\\c2.  (5) 

C 


3.1.  Joint  sparse  representation  classification 

In  many  applications  including  target  classification,  there  are  several 
sources  of  information  that  should  be  fused  to  make  an  optimal  classification 
decision.  It  is  well-known  that  information  fusion  of  sensors  can  generally 
result  in  better  situation  awareness  and  decision  making.10  Many  algo¬ 
rithms  have  been  proposed  for  sensor  fusion  in  the  classification  problem. 
These  methods  can  generally  be  categorized  in  two  sets,  feature  fusion12 
and  classifier  fusion  14  algorithms.  Classifier  fusion  algorithms  aggregate 
the  decisions  from  different  classifiers  which  are  individually  built  based 
on  different  sources.  Different  methods  of  decision  fusion  include  majority 
vote,31  fuzzy  logic32  and  statistical  inference.11  For  example,  in  the  context 
of  the  sparse  representation  classification  for  target  classification,  the  recon¬ 
struction  error  generated  by  using  PIR  and  Seismic  features  individually 
can  be  combined,  after  proper  normalization,  for  a  fused  decision.  This  is 
also  known  as  holistic  sparse  representation  classification  (HSRC).33  While 
classifier  fusion  is  a  well-studied  topic,  feature  level  fusion  is  a  relatively 
less-studied,  specifically  for  fusing  heterogeneous  source  of  information  due 
to  the  incompatibility  of  feature  sets.34  Feature  level  fusion  using  sparse 
representation  has  also  been  recently  introduced  and  has  shown  promising 
results. 35-38 

Among  different  sparsity  based  algorithms  for  information  fusion,  joint 
sparse  representation  classification  (JSR.C)  is  probably  the  most  cited  ap¬ 
proach. 21,30,39  In  JSRC,  multiple  observations  of  a  pattern  using  different 
modalities  are  simultaneously  represented  by  a  few  training  samples.  Con¬ 
sider  the  C-class  classification  problem  and  let  S  =  {1, . . . ,  S}  be  a  finite  set 
of  available  modalities.  Similar  to  the  previous  section,  let  N  =  Nc 
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be  the  number  of  the  training  samples  from  each  modalities,  where  Nc  is 
the  number  of  training  samples  in  the  cth  class.  Also  let  ns,  s  G  S,  be  the  di¬ 
mension  of  the  feature  vector  for  the  sth  modality  and  xsc  •  G  R"  denote  the 
jth  sample  of  the  sth  modality  that  belongs  to  the  cth  class,  j  €  {1, . . . ,  Nc}. 
In  JSRC,  S  dictionaries  Xs  =  [XfX| . . .  Xsc\  G  RnSxN,s  G  S,  are  con¬ 
structed  from  the  (normalized)  training  samples,  where  the  class-wise  sub¬ 
dictionary  X*  =  i,  cc*  2)  •  •  • ,  xsc  N  ]  G  RnSxN<=  consists  of  samples  from 
the  cth  class  and  sth  modality. 

Given  a  multimodal  test  sample  {ps},  where  ps  G  R™s,s  G  S ,  the 
assumption  is  that  the  test  sample  ps  from  the  cth  class  lies  approximately 
within  the  subspace  formed  by  the  training  samples  of  the  cth  class  and  can 
be  approximated  (or  reconstructed)  from  a  few  number  of  training  samples 
in  X®,28  similar  to  the  SRC  algorithm.  In  other  words,  if  the  test  sample 
ps  belongs  to  the  cth  class,  it  is  represented  as: 

ps  =  Xsas  +  e,  (6) 

where  as  G  is  a  coefficient  vector  whose  entries  are  mostly  0’s  ex¬ 
cept  for  some  of  the  entries  associated  with  the  cth  class,  i.e. ,  as  = 
[02  , . . . ,  0T,  af  .  02 , . . . ,  0r]  ,  and  e  is  a  small  error  term  due  to  imper¬ 
fectness  of  the  samples.  In  addition,  JSRC  recognizes  the  relation  between 
different  modalities  representing  the  same  event  and  enforces  collaboration 
among  them  to  make  a  joint  decision.  This  is  achieved  by  constraining 
the  coefficient  vectors  from  different  modalities  to  have  the  same  sparsity 
pattern.  Consequently,  the  same  training  samples  from  different  modalities 
are  used  to  reconstruct  the  test  data.  To  illustrate  the  idea,  consider  the 
target  classification  application  with  PIR  and  Seismic  sensors.  Let  p1  and 
p2  be  the  feature  vectors  extracted  from  PIR  and  Seismic  sensors,  respec¬ 
tively.  Also  let  the  multimodal  test  sample  belongs  to  the  cth  class.  Using 
the  idea  of  sparse  representation  discussed  in  previous  Section,  test  samples 
can  be  reconstructed  using  a  linear  combination  of  atoms  in  X1  and  X2, 
i.e.,  p1  =  X1o:1  +  el,p2  =  X2a2  +  e2,  where  a 1  and  a 2  are  the  coef¬ 
ficient  vectors  whose  entries  have  value  0’s  except  for  some  of  the  entries 
associated  with  the  cth  class,  and  e1  and  e2  are  small  error  terms.  Let  I1 
and  X2  be  two  index  sets  corresponding  to  non-zero  rows  of  a 1  and  a 2, 
respectively.  In  JSRC  algorithm,  it  is  further  assumed  that  p1  and  p2  can 
be  reconstructed  from  the  same  training  samples  corresponding  to  dictio¬ 
naries  X1  and  X2  with  possibly  different  coefficients  because  they  belong 
to  the  same  event  and  therefore  1 1  =  I2.  In  other  words,  A  =  [a1,  a2]  is 
a  row-sparse  matrix  with  only  a  few  non-zeros  rows.  In  general,  the  coeffi- 
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cient  matrix  A  =  [a1, . . . ,  ccs]  G  M.NxS ,  where  a:s  is  the  sparse  coefficient 
vector  for  reconstructing  ps ,  is  recovered  by  solving  the  following  fi/^2  joint 
optimization  problem: 


argmin  ^  f(A)  +  \\\A\\il/iq,  (7) 

A=[q'1,...,cks'] 

where  f(A)  =  \  J2s=i  IIps  —  -Xlsa:s||22  1S  reconstruction  error,  A  >  0 
is  a  regularization  parameter,  and  norm  is  defined  as  || ^4||^1/Jg2  = 

SjLr  || aj \\t2  in  which  aj’s  are  row  vectors  of  A.  The  above  optimization 
problem  encourages  sharing  of  patterns  across  related  observations  which 
results  in  the  solution  A  to  have  a  common  support  at  the  column  level.21 
The  solution  can  be  obtained  by  using  the  efficient  alternating  direction 
method  of  multipliers.40 

Again  let  5c(a)  G  be  a  vector  indication  function  in  which  the  rows 
corresponding  to  cth  class  are  retained  and  the  rest  are  set  to  zeros.  Similar 
to  the  SRC  algorithm,  the  test  data  is  classified  using  the  class-specific 
reconstruction  errors  as: 

s 

c*  =  argmin^  ||ps  -  Xs 5c{as*)\\2t2,  (8) 

C  i 

S=  1 

where  as*’s  are  optimal  solutions  of  (7). 

3.2.  Tree- structured  sparse  representation  classification 

The  joint  sparsity  assumption  of  JSRC  may  be  too  stringent  for  applica¬ 
tions  in  which  not  all  the  different  modalities  are  equally  important  for 
classification.  Recently  tree-structured  sparsity16,17  is  proposed  to  provide 
a  flexible  framework  for  information  fusion.  It  uses  the  prior  knowledge 
in  grouping  different  modalities  by  encoding  them  in  a  tree  and  allows 
different  modalities  to  be  fused  at  multiple  granularity.  Th  leaf  nodes  rep¬ 
resent  individual  modalities  in  the  tree  and  the  internal  nodes  represent 
different  grouping  of  the  modalities.  A  tree-structured  groups  of  modalities 
gc( 2s  \  0)  is  defined  as  a  collection  of  subsets  of  the  set  of  modalities  S 
such  that  U geg  g  =  S  and  Wg,  g  G  G,  {g  n  g  ±  0)  =>  (( g  C  g)  V  (g  C  g )).  It 
is  assumed  here  that  Q  is  ordered  according  to  relation  =<:  which  is  defined 
as  (g  4  g)  =>  {{g  C  g)  V  (g  <1  g  =  0)). 

Given  a  tree-structured  collection  Q  of  groups  and  a  multimodal 
test  sample  {ps},  the  tree-structured  sparse  representation  classification 
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(TSRC)  solves  the  following  optimization  problem: 

argmin  f{A)  +  A  Q  (A) ,  (9) 

A=[ck1  ,...,ces] 

where  f(A)  is  defined  the  same  as  in  Eq.  (7),  and  the  tree-structured  spar¬ 
sity  prior  fl  (A)  is  defined  as: 

N 

(10) 

3= i geS 

In  Eq.  (10),  ujg  is  a  positive  weight  for  group  g  and  agg  is  a  (1  x  S)  row  vector 
whose  coordinates  are  equal  to  the  jth  row  of  A  for  indices  in  the  group  g, 
and  0  otherwise.  An  accelerated  algorithm  can  be  used  to  efficiently  solve 
the  optimization  problem.1' 

The  above  optimization  problem  results  in  A*  that  has  a  common  sup¬ 
port  at  the  group  level  and  the  resulting  sparsity  is  dependent  on  the  rel¬ 
ative  weights  ujg  of  different  groups.16  In  the  special  case  where  Q  consists 
of  only  one  group,  containing  all  modalities,  then  Eq.  (9)  reduces  to  that 
of  JSRC  in  Eq.  (7).  On  the  other  hand,  if  Q  consists  of  only  singleton  sets 
of  individual  modalities,  no  common  sparsity  pattern  is  sought  across  the 
modalities  and  the  optimization  problem  of  Eq.  (9)  reduces  to  S  separate 
optimization  problems.  The  tree-structured  sparsity  prior  provides  flex¬ 
ibility  in  the  expense  of  the  need  to  apiori  select  the  weights  Log,g  £  Q, 
which  will  be  discussed  in  more  details  in  the  next  section. 

4.  Dictionary  learning 

In  the  discussed  sparsity  based  classification  algorithms,  SRC,  JSRC  and 
TSRC,  the  dictionary  is  constructed  by  stacking  the  training  samples.  In 
contrast  to  the  conventional  classification  algorithms,  above  algorithms  do 
not  have  a  “training”  step  and  most  of  the  computation  is  performed  at  the 
test  time,  i.e. ,  an  optimization  problem  needs  to  be  solved  for  each  test 
sample.  These  results  in  at  least  two  difficulties.  First,  the  computational 
cost  of  the  optimization  problem  becomes  more  and  more  expensive  as  the 
number  of  training  samples  increases.  Second,  the  dictionary  constructed 
by  stacking  the  training  samples  is  not  optimal  neither  for  the  reconstruc¬ 
tive  tasks41  nor  the  discriminative  tasks.18  Recently  it  has  been  shown  that 
dictionary  learning  can  overcome  the  above  limitations  and  significantly  im¬ 
prove  the  performance  in  several  applications  including  image  restoration,42 
face  recognition43  and  object  recognition.44,45  In  contrast  to  principal  com¬ 
ponent  analysis  and  its  variants,  dictionary  learning  algorithms  generally 
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do  not  impose  orthogonality  condition  and  are  more  flexible  allowing  to 
be  well-tuned  to  the  training  data.  Moreover,  the  size  of  the  learned  dic¬ 
tionaries  are  usually  smaller  than  the  number  of  training  samples46,4'  and 
are  more  appropriated  for  real-time  applications.  Dictionary  learning  al¬ 
gorithms  can  generally  be  categorized  into  two  groups:  unsupervised  and 
supervised  which  are  discussed  in  the  following  sections. 

4.1.  Unsupervised  dictionary  learning 

Let  X  =  [xi,  X2,  ■  ■  ■ ,  xpj\  G  R”xAr  be  the  collection  of  N  (normalized) 
training  samples.  In  an  unsupervised  setting,  the  dictionary  D  G  Wnxd  is 
obtained  irrespective  of  the  class  labels  of  the  training  samples  by  minimiz¬ 
ing  of  the  following  reconstructive  cost:43 

1  N 

9N(D)^~Y/lu(xi,D),  (11) 

i—  1 

over  the  regularizing  convex  set  V  =  {D  G  Knxd|  ||dfc||^2  <  l,Vfc  = 
l,...,d},  where  dk  is  the  kth  column,  or  atom,  of  the  dictionary.  The 
loss  lu  is  defined  as 

lu(x,D)=  min  ||as  -  Da\\g  +  A||a||^,  (12) 

ael<i 

which  is  the  optimal  value  of  the  sparse  coding  problem.  It  is  usually 
preferred  to  find  the  dictionary  by  minimizing  an  expected  risk,  rather 
than  the  perfect  minimization  of  the  empirical  cost  for  the  generalization 
purpose.48  A  parameter-free  online  algorithm  has  also  been  proposedd41 
to  find  the  dictionary  D  as  the  minimizer  of  the  following  stochastic  cost 
over  the  convex  set  T>\ 


g(D)±Ex[lu(X,D)},  (13) 

where  it  is  assumed  that  the  data  x  is  drawn  from  a  finite  probability 
distribution.  The  trained  dictionary  can  then  be  integrated  into  the  SRC 
algorithm. 

The  trained  dictionary  can  also  be  used  for  feature  extraction.  In  this 
setting,  the  sparse  code  o:*,  generated  as  a  solution  of  (12),  is  used  as  a 
latent  feature  vector  representing  the  input  signal  x  in  the  classical  expected 
risk  optimization  for  training  a  classifier:19 

min  [l  (y,w,a*)}  + 

ioG  W  Z 


(14) 
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where  y  is  the  ground  truth  class  label  associated  with  the  input  x,  w  is  the 
classifier  parameters,  v  is  a  regularizing  parameter,  and  l  is  a  convex  loss 
function  that  measures  how  well  one  can  predict  y  given  a*  and  w.  Note 
that  in  Eq.  14,  the  dictionary  D  is  optimized  independent  of  the  classifier 
and  class  label. 

4.2.  Supervised  dictionary  learning 

The  dictionary  trained  in  the  unsupervised  setting  is  not  optimal  for  clas¬ 
sification.  In  a  supervised  formulation,  the  class  labels  can  be  used  to  train 
a  discriminative  dictionary.  In  the  most  straightforward  extension  from 
the  unsupervised  setting,  the  dictionary  D*  can  instead  be  obtained  by 
learning  class-specific  sub-dictionaries  D*,c  =  1, . . . ,  C,  in  a  C-class  classi¬ 
fication  problem  using  the  formulation  of  Eq.  (13)  by  sampling  the  input 
from  the  corresponding  c-th  class  population.  The  overall  dictionary  is  then 
constructed  as  D*  =  [ D\  . . .  £)£,]. 43  The  C  different  sub-dictionaries  are 
trained  independently  and  some  of  the  sub-dictionaries  can  possibly  share 
similar  atoms  that  may  adversely  affect  the  discrimination  performance. 
An  incoherence  term  has  been  proposed  to  be  added  to  the  cost  function  to 
make  the  derived  dictionaries  independent.49  In  another  formulation,  a  dis¬ 
criminative  term  is  added  to  the  reconstruction  error  and  the  overall  cost  is 
minimized.18,50  More  recently,  it  has  been  shown  that  better  performance 
can  be  obtained  by  learning  the  dictionary  in  a  task-driven  formulation.19 
In  the  task-driven  dictionary  learning,  the  optimal  dictionary  and  classi¬ 
fier  parameters  are  obtained  jointly  by  solving  the  following  optimization 
problem:19 

r.  “in  v..Ev,<c[lsu(y,w,a*(x,D))]  +  ^\\w\\}  .  (15) 

.D£Z>,iu£W  Z 

The  learned  task-driven  dictionary  has  been  shown  to  result  in  a  supe¬ 
rior  performance  compared  to  the  unsupervised  setting.19  In  this  setting, 
the  sparse  codes  are  indeed  the  optimized  latent  features  for  the  classifier. 
While  the  above  dictionary  learning  algorithms  are  mostly  developed  for 
single-modal  scenarios,  it  has  been  shown  that  learning  multimodal  dictio¬ 
naries  under  structured  sparsity  priors  can  be  beneficial.51,52 

5.  Results 

This  section  presents  the  results  of  target  classification  on  a  real  data  set, 
which  was  generated  from  field  data  using  one  passive  infrared  (PIR)  and 
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three  seismic  sensors.  The  time-series  are  classified  to  predict  two  targets: 
(i)  human  walking  alone,  and  (ii)  animal  led  by  a  walking  human.  These 
targets  moved  along  an  approximately  150  meters  long  trail  and  returned 
along  the  same  trail  to  the  starting  point;  all  targets  passed  by  the  sensor 
sites  at  a  distance  of  approximately  5  meters.  Signals  from  both  PIR  and 
seismic  sensors  were  acquired  at  a  sampling  frequency  of  10  kHz  and  each 
test  was  conducted  over  a  period  of  approximately  50  seconds.  The  subset 
of  data  used  here  consists  of  two  days  data.  Day  1  includes  47  human  targets 
and  35  animal-led-by-human  targets  while  the  corresponding  numbers  for 
Day  2  are  32  and  34,  respectively.  A  two-way  cross-validation  is  used  to 
assess  the  performance  of  the  classification  algorithms,  i.e. ,  Day  1  data  is 
used  for  training  and  Day  2  is  used  as  test  data  and  vice  versa.  The  reported 
results  are  the  average  of  the  results  on  two  different  test  data  sets. 

The  alphabet  size  |E|  for  the  SDF  feature  extraction  algorithm  is  chosen 
to  be  30  as  has  been  reported  optimal  for  target  classification  in  previous 
study.9  The  regularization  parameters,  and  the  dictionary  size  in  cases 
where  dictionary  learning  is  used,  are  chosen  based  on  minimizing  the  clas¬ 
sifications  cost  on  a  small  validation  set. 

For  TSRC,  the  tree-structured  set  of  groups  is  selected  to  be  Q  = 
{.9i,S2,53,54,ff5,ff6}  =  {{1}, {2}, {3}, {1,2, 3}, {4}, {1,2, 3, 4}}  where  1,  2 
and  3  refer  to  the  seismic  channels  and  4  refers  to  the  PIR  channel.  The 
relative  weights  for  different  groups  are  chosen  to  satisfy  togi  =  uig2  =  w93  = 
lo95  =  0.001wg4  and  uige  =  10w94.  This  leaves  the  tuning  to  be  done  using 
only  one  parameter  wff3  which  is  selected  using  validation  set.  The  un¬ 
derlying  assumption  is  that  the  three  seismic  channels  are  correlated  and 
therefore,  the  weight  for  group  <74  is  selected  to  be  relatively  big  compared 
to  the  weights  for  individual  seismic  sources  to  encourage  joint  sparsity 
between  the  three  seismic  channels.  However,  the  weight  for  is  the 
biggest  one  which  encourages  collaboration  among  all  sources.  Although 
the  value  of  correlation  between  different  modalities  are  different,  but  the 
tree-structured  sparsity  allows  collaboration  at  different  granularities.  Af¬ 
ter  the  relative  relations  of  different  weights  are  selected  apriori,  the  value 
of  UJ93  is  chosen  using  cross  validation.  It  is  noted  that  the  results  of  the 
tree-structured  sparsity  is  not  sensitive  to  the  exact  values  of  the  weights 
and  similar  results  are  obtained  with  different  set  of  weights  as  long  as  the 
underlying  prior  information  is  correctly  formulated. 

A  straightforward  way  of  utilizing  the  unsupervised  and  supervised 
(single-modal)  dictionary  learning  algorithms  for  multimodal  classification 
is  to  train  independent  dictionaries  and  classifiers  for  each  modality  and 
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then  combine  the  individual  scores  for  a  fused  decision.  The  corresponding 
algorithms  are  referred  as  UDL  and  SDL.  This  way  of  fusion  is  equivalent 
to  using  the  l\\  norm  on  A ,  instead  of  in  norm,  in  Eq.  (7)  which  does  not 
enforce  row  sparsity  in  the  sparse  coefficients  and  is  a  decision-level  fusion 
algorithm.  The  number  of  dictionary  atoms  for  UDL  and  SDL  is  chosen  to 
be  20,  10  per  class,  and  the  quadratic  cost19  is  used.  The  dictionaries  for 
JSRC  and  JDSRC  are  constructed  using  all  available  training  samples. 

The  performances  of  the  presented  classification  algorithms  under  dif¬ 
ferent  sparsity  priors  are  also  compared  with  those  of  the  several  state-of- 
the-art  decision-level  and  feature-level  fusion  algorithms.  For  decision  level 
fusion  algorithms  linear  support  vector  machine  (SVM)20  and  logistic  re¬ 
gression  (LR)20  classifiers  are  trained  on  individual  modalities  and  the  fused 
decision  is  obtained  by  combining  the  score  of  individual  modalities.  These 
approaches  are  abbreviated  as  SVM-Sum  and  LR-Sum,  respectively.  The 
performance  of  the  algorithms  are  also  compared  with  feature-level  fusion 
methods  including  the  holistic  sparse  representation  classifier  (HSRC),33 
the  joint  dynamic  sparse  representation  classifier  (JDSRC),33  relaxed  col¬ 
laborative  representation  (RCR),38  and  multiple  kernel  learning  (MKL).53 
The  HSRC  is  a  simple  modification  of  SRC  for  multimodal  classification 
in  which  the  feature  vectors  from  different  sources  are  concatenated  into 
a  longer  feature  vector  and  SRC  is  applied  on  the  concatenated  feature 
vectors.  JDSRC  and  RCR  are  also  recently  applied  to  a  number  of  applica¬ 
tions  with  better  performance  than  JSRC.  For  the  MKL  algorithm,  linear, 
polynomial,  and  RBF  kernels  are  used. 

Table  1  summarizes  the  average  human  detection  rate  (HDR),  human 
false  alarm  rate  (HFAR),  and  correct  classification  rates  (CCR)  obtained 
using  different  multimodal  classification  algorithms.  As  seen,  the  sparsity 
based  feature- level  fusion  algorithm,  JDSRC,  JSRC,  and  TSRC  have  re¬ 
sulted  in  better  performances  than  the  competing  algorithms  with  TSRC 
achieving  the  best  classification  result.  Moreover,  it  is  seen  that  the  struc¬ 
tured  sparsity  prior  has  indeed  resulted  in  improved  performance  compared 
to  the  HSRC  algorithm.  While  the  SDL  algorithm  result  in  similar  perfor¬ 
mance  compared  to  the  counterpart  HSRC  algorithm,  it  should  be  noted 
that  SDL  enjoys  more  compact  dictionaries.  Developing  multimodal  dictio¬ 
naries  using  structured  sparsity51  can  potentially  improve  the  results  which 
is  a  topic  of  future  research. 
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HDR 

HFAR 

CCR 

SVM-Sum 

0.94 

0.13 

91.22% 

LR-Sum 

0.97 

0.13 

92.57% 

RCR 

0.94 

0.12 

91.22% 

MKL 

0.94 

0.08 

93.24% 

HSRC 

0.96 

0.10 

93.24% 

JSRC 

1.00 

0.12 

94.59% 

JDSRC 

0.97 

0.08 

94.59% 

TSRC 

1.00 

0.09 

95.65% 

UDL 

0.92 

0.13 

89.86% 

SDL 

0.94 

0.08 

93.24% 

6.  Conclusions 

This  chapter  has  presented  an  overview  of  symbolic  dynamic  filtering  as  a 
feature  extraction  algorithm  for  time-series  data.  The  recent  developments 
in  the  area  of  sparse  representation  for  multimodal  classification  have  also 
been  presented.  For  this  purpose,  structured  sparsity  priors,  which  enforce 
collaboration  across  different  modalities,  are  studies.  In  particular,  the  tree- 
structured  sparsity  model  allows  extraction  of  cross-correlated  information 
among  multiple  modalities  at  different  granularities.  Finally,  unsupervised 
and  supervised  dictionary  learning  algorithms  were  reviewed.  The  perfor¬ 
mances  of  the  discussed  algorithms  are  evaluated  for  the  application  of 
target  classification.  The  results  show  that  the  feature  fusion  algorithms 
using  sparsity  models  achieve  superior  performance  as  compared  to  the 
counter-part  decision-fusion  algorithms.  An  interesting  topic  of  future  re¬ 
search  includes  development  of  multimodal  dictionary  learning  algorithms 
under  different  structured  sparsity  priors. 
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